Implement retry state machine aligned with Kotlin/Swift#144
Implement retry state machine aligned with Kotlin/Swift#144MichaelGHSeg wants to merge 5 commits into
Conversation
Port the RetryStateMachine architecture from analytics-kotlin to align retry behavior across all three SDKs. The state machine classifies HTTP status codes, tracks per-batch failure counts with exponential backoff, handles 429 rate limiting with Retry-After support, and accepts dynamic configuration from CDN settings (httpConfig). Key changes: - Add Retry/ module: RetryStateMachine, RetryState, RetryConfig, RetryTypes, TimeProvider, RetryStateStorage, HttpConfigParser - Update EventPipeline and SyncEventPipeline upload loops to consult the state machine before/after each upload - Add X-Retry-Count header and Retry-After parsing to HTTPClient - Wire SegmentDestination to parse httpConfig from CDN settings - Update e2e-cli with retry-enabled pipeline provider and poll loop - Add 47 unit tests and enable retry + retry-settings e2e suites (68 e2e tests passing)
MichaelGHSeg
left a comment
There was a problem hiding this comment.
Note
Self-Review
Posted by PR author (MichaelGHSeg) — self-reviews cannot apply the APPROVED label. Intended verdict: Manual Review Needed plugin-review for the APPROVED label.
Code Review Personae Verdict: Manual Review Needed ⚠️
Strong port of the Kotlin RetryStateMachine — the state-machine logic (HandleResponse, ShouldUploadBatch, ShouldDeleteBatch, backoff math, 429/Retry-After handling, maxRetryCount semantics) matches the Kotlin reference and is correct. Two issues warrant changes before merge: (1) the hand-rolled JSON state serializer escapes string keys on write but never unescapes on read, so on Windows (where batch keys are absolute file paths with backslashes) persisted retry metadata never round-trips — orphaning backoff timers and the max-retry drop counter after restart; (2) the e2e harness can mask a genuinely dropped batch as a passing test.
Several medium IO/threading items are worth addressing for alignment with Kotlin.
Warning
Required Actions
- Unescape JSON string keys/values on read in RetryStateStorage (or use a real JSON codec) — Windows path keys currently fail to round-trip (Analytics-CSharp/Segment/Analytics/Retry/RetryStateStorage.cs:191)
- Stop clearing deliveryErrors mid-poll in e2e-cli — a dropped batch can be masked as success when other batches are still pending (e2e-cli/Program.cs:261)
📝 Deep Code Review (3 Passes) — 2 High, 5 Medium, 3 Low ❌
Three independent bug-finder passes plus a manual verification of the top claims against the repo and the Kotlin reference.
Consensus (found by 2+ passes):
- RetryStateStorage escape/unescape round-trip — flagged High by passes 1 & 2, and verified:
FileEventStream.Read()returnsFileInfo.FullName(absolute paths), so on Windows batch keys contain backslashes.EscapeJsonStringdoubles them on write butParseBatchMetadata/ParseJsonFieldsextract keys with rawSubstringand never unescape — the loaded key no longer matches the live URL. - Per-batch
SaveRetryStateon the network dispatcher — flagged by passes 2 & 3. Kotlin wrapssaveRetryStateinwithContext(fileIODispatcher)(EventPipeline.kt:273); the C# port writes the full prefs blob inline on NetworkIODispatcher, once per batch per flush. - Unsynchronized
_retryStateMachineswap — flagged by passes 1 & 2.UpdateHttpConfig(AnalyticsDispatcher) reassigns the field whileUpload()(NetworkIODispatcher) reads it; no volatile/lock.
Verified correct (no action): backoff overflow is capped (Math.Min vs Infinity → maxMs, no NaN), MaxTotalBackoffDuration*1000 does not overflow long, _retryState is correctly preserved across UpdateHttpConfig (matches Kotlin), Retry-After=0 yields immediate retry, exception-as-500 handling matches Kotlin.
Findings
- 🔴 HIGH: [Found by 2/3 passes] State keys escaped on write but never unescaped on read — Windows path keys break round-trip (
Analytics-CSharp/Segment/Analytics/Retry/RetryStateStorage.cs:191)
Batch keys are full file paths:FileEventStream.Read()returns_directory.GetFiles().Select(f => f.FullName)(EventOutputStream.cs:222). On Windows these contain backslashes (e.g.C:\Users\x\...\segment.events.123.tmp).SerializeStatecorrectly callsEscapeJsonString(kvp.Key)(line 61/76), turning\into\\. ButDeserializeStatenever reverses it:ParseBatchMetadata(line 191) andParseJsonFields(line ~147) extract keys/values with rawjson.Substring(...)andIndexOf('"'). So a key written fromC:\Users\...is read back with doubled backslashes and no longer equals the liveurlpassed toShouldUploadBatch/GetRetryCount. After any save→load round-trip (process restart; the in-loop SaveRetryState then a fresh pipeline's LoadRetryState), per-batch FailureCount/NextRetryTime/FirstFailureTime are orphaned: backoff timers reset and the per-batch max-retry drop limit never trips, so a permanently-failing batch retries forever. Additionally,IndexOf('"')is not escape-aware, so a key containing an escaped quote terminates early and corrupts the rest of the parse. Kotlin avoids this by using kotlinx.serialization (a real JSON codec).
Suggestion: Add anUnescapeJsonStringhelper (reverse ofEscapeJsonString:\\→\,\"→") and apply it to every parsed key and string value; make the closing-quote scan skip the char after a backslash. Add a round-trip unit test with a Windows-style path key likeC:\\a\\b.tmp. - 🔴 HIGH: deliveryErrors.Clear() in poll loop can mask a genuine batch drop (
e2e-cli/Program.cs:261)
When a batch is dropped (DropBatchDecision, or a 4xx/non-2xx with ShouldDeleteBatch==true), the pipeline callsReportInternalError→CapturingErrorHandlerappends todeliveryErrors, and removes the file in the SAME cycle. The poll loop callsdeliveryErrors.Clear()on every cycle wherepending.Count > 0(line 261). So if a batch is genuinely dropped while other batch files are still pending, the next poll sees pending>0, wipes the drop error, and once the survivors upload the harness reports success=true — masking a real dropped/rejected batch as a passing test. The single-batch case works only because pending hits 0 in the same cycle the error fires, skipping the clear.
Suggestion: Don't clear deliveryErrors during the poll loop. Track a separate never-cleared dropped-batch count (or accumulate+dedupe errors) and treat any drop as failure regardless of whether other batches later succeed. - 🟡 MEDIUM: [Found by 2/3 passes] SaveRetryState rewrites the full prefs file per batch on the network thread (diverges from Kotlin) (
Analytics-CSharp/Segment/Analytics/Utilities/EventPipeline.cs:242)
RetryStateStorage.SaveRetryState→storage.WritePrefs→UserPrefs.Putserializes the entire prefs cache and rewrites the file (with a .bak copy). The Upload loop calls it once per batch — both the drop path (line 186) and the post-response path (line 242) — even on a plain 2xx success where nothing changed. For N pending batches that is N full-file writes per flush, and the e2e poll loop re-flushes every interval. Kotlin moves this off the network thread viawithContext(fileIODispatcher) { storage.saveRetryState(...) }(EventPipeline.kt:273-274). Correctness is fine; this is an IO/throughput regression and a dispatcher divergence. Mirror exists in SyncEventPipeline.cs:267.
Suggestion: Marshal SaveRetryState onto FileIODispatcher (Scope.WithContext) as Kotlin does, and/or save once at the end of the loop or only when _retryState actually changed (skip the legacy-mode success path). - 🟡 MEDIUM: [Found by 2/3 passes] Unsynchronized _retryStateMachine swap across dispatchers (
Analytics-CSharp/Segment/Analytics/Utilities/EventPipeline.cs:81)
_retryStateMachineand_retryStateare plain mutable fields (no lock/volatile).UpdateHttpConfig(line 76-82) runs on AnalyticsDispatcher (SegmentDestination.Update is launched there) and swaps_retryStateMachine, whileUpload()reads it repeatedly per batch on NetworkIODispatcher. Reference assignment is atomic on the CLR (no torn pointer/crash), but there is no happens-before barrier: the upload thread can keep using the stale (legacy) machine for an unbounded time, or switch machines mid-flush between batches, yielding inconsistent decisions in one cycle. Kotlin has the samevarpattern, but C#'s distinct dispatchers make the race observable. Note:_retryStateitself IS correctly preserved across UpdateHttpConfig (matches Kotlin).
Suggestion: Mark_retryStateMachinevolatile, or read it once into a local at the top of each upload iteration and use that local for all calls in the batch. - 🟡 MEDIUM: PipelineState persisted as raw ordinal int; fragile and silently degrades to Ready (
Analytics-CSharp/Segment/Analytics/Retry/RetryStateStorage.cs:48)
SerializeStatewrites(int)state.PipelineState(line 48) andDeserializeStateonly sets RateLimited when the parsed int == 1 (line 90). This couples the on-disk format to the enum declaration order (Ready=0, RateLimited=1). If a value is ever inserted before RateLimited, a previously-persisted RateLimited state deserializes as Ready, so a rate-limited pipeline resumes uploading immediately after restart and ignores the server's Retry-After window. Kotlin serializes the enum by name (order-independent). Low likelihood (enums are append-only by convention) but it controls rate-limit safety.
Suggestion: Serialize/deserialize PipelineState by name ("Ready"/"RateLimited") rather than ordinal, matching Kotlin. - 🟡 MEDIUM: ReportInternalError fires on every routine non-2xx drop, surfacing benign rejections as errors (
Analytics-CSharp/Segment/Analytics/Utilities/EventPipeline.cs:222)
On any non-success response where ShouldDeleteBatch returns true (e.g. a plain 400/413, or a 3xx classified as Drop), the loop callsReportInternalError(NetworkServerRejected, "HTTP {code}: batch rejected by server")(line 222-223). A batch the server legitimately rejects (the correct terminal outcome) is reported as an error. Combined with the e2e harness's new deliveryErrors-based failure gate, a final-cycle drop fails the run even though the SDK behaved correctly and no files remain. The old Upload() path also reported on 4xx, so this is not strictly new, but the interaction with the e2e gate is.
Suggestion: In e2e-cli base success purely on remaining pending files (a dropped batch is a completed delivery decision). Optionally in the pipeline only report unexpected drops, not expected 4xx rejections. - 🟡 MEDIUM: Fast 200 success before first poll causes CLI to wait the full timeout (
e2e-cli/Program.cs:243)
everSeenPendingonly becomes true ifPendingUploads()is non-empty during a poll. With the 2s initThread.Sleepplus a low flushAt, a fast 200 upload can complete before the first poll (300ms in). everSeenPending stays false, the early-break (stableEmptyCount >= 2) is never reached, and the loop runs until the deadline (default 20s). The result is correct (remaining empty, no errors → success) but every fast-success test pays the full timeout, slowing the whole suite.
Suggestion: Capture the initial pending count right after the explicit Flush() before polling, or allow the stable-empty early-break even when everSeenPending is false once a flush has been issued. - 🔵 LOW: CDN httpConfig update is ignored for custom IEventPipeline implementations (
Analytics-CSharp/Segment/Analytics/Plugins/SegmentDestination.cs:92)
Update() casts_pipelineto EventPipeline and SyncEventPipeline (lines 92/95) and calls UpdateHttpConfig on whichever matches. Any custom IEventPipeline supplied via Configuration.EventPipelineProvider silently ignores CDN httpConfig and stays in its constructed mode. The e2e-cli provider returns a concrete EventPipeline so it's unaffected, but third-party pipelines won't receive CDN-driven enable/disable. Correctness gap, not a crash.
Suggestion: Add UpdateHttpConfig (or an httpConfig setter) to the IEventPipeline interface, or document that custom pipelines must consume httpConfig themselves. - 🔵 LOW: Legacy Upload() lost 3xx error reporting (now effectively dead in the pipeline) (
Analytics-CSharp/Segment/Analytics/Utilities/HTTPClient.cs:99)
The rewrittenUpload()no longer reports NetworkUnexpectedHttpCode for 3xx (the old switch did); 3xx now falls through toreturn false(keep for retry). Upload() is public API but the pipeline now exclusively uses UploadWithResponse(), so this only affects external callers of Upload(). Minor behavior change.
Suggestion: If Upload() must preserve old semantics for external callers, restore the 3xx ReportInternalError branch; otherwise consider marking Upload() obsolete. - 🔵 LOW: Programmatically-built RetryConfig path never calls Validated() (aligned with Kotlin) (
Analytics-CSharp/Segment/Analytics/Retry/RetryConfig.cs:304)
The CDN path (HttpConfigParser.Parse) calls Validated(), but the pipeline constructors build RetryConfig directly from HttpConfig.RateLimitConfig/BackoffConfig without re-validating, and e2e-cli constructs configs from raw test JSON (unclamped). A 0 or negative maxRetryCount makes ShouldUploadBatch drop every batch on the first attempt (GlobalRetryCount 0 >= MaxRetryCount <=0). This matches the Kotlin reference exactly (Kotlin also skips validated() in init/updateHttpConfig and builds e2e configs raw), so it is aligned behavior — noting only as a defensive-clamp opportunity at the input boundary.
Suggestion: Optionally clamp maxRetries in e2e-cli (Math.Max(1, ...)) to avoid confusing test results; leaving the library aligned with Kotlin is fine.
📝 Policy Compliance Review — No violations (3 policies evaluated)
Evaluated 3 policies (POL-001-port-forward-to-prod-db, POL-002-prod-db-write-from-laptop, POL-003-credentials-in-pr) across all review surfaces (24 surfaces: PR body, commit messages, one code surface per changed file). No violations detected. No loader warnings.
Review Stages Executed
- ✅ Deep Code Review (3 Passes)
- ✅ Policy Compliance Review
Generated by code-review-personae v0.8.1 | Deep Code Review (3 Passes), Policy Compliance Review
| if (parsedConfig != null) | ||
| { | ||
| EventPipeline concretePipeline = _pipeline as EventPipeline; | ||
| concretePipeline?.UpdateHttpConfig(parsedConfig); |
There was a problem hiding this comment.
Note
🔵 LOW: CDN httpConfig update is ignored for custom IEventPipeline implementations
Update() casts _pipeline to EventPipeline and SyncEventPipeline (lines 92/95) and calls UpdateHttpConfig on whichever matches. Any custom IEventPipeline supplied via Configuration.EventPipelineProvider silently ignores CDN httpConfig and stays in its constructed mode. The e2e-cli provider returns a concrete EventPipeline so it's unaffected, but third-party pipelines won't receive CDN-driven enable/disable. Correctness gap, not a crash.
Suggestion: Add UpdateHttpConfig (or an httpConfig setter) to the IEventPipeline interface, or document that custom pipelines must consume httpConfig themselves.
There was a problem hiding this comment.
Addressed in eda2991 by documenting the contract rather than expanding the public API.
Why not add UpdateHttpConfig to the interface: HttpConfig/RateLimitConfig/BackoffConfig are all internal, so putting a typed UpdateHttpConfig(HttpConfig) on the public IEventPipeline would force the whole config hierarchy public; and adding any member to a public interface is source-breaking for third-party implementers (default interface methods aren't available on netstandard2.0). For alignment, Kotlin doesn't expose this extension point at all — its pipeline field is the concrete EventPipeline, not an interface — so the gap is specific to C#'s extra abstraction.
Added an XML doc on IEventPipeline and an inline note at the cast site in SegmentDestination.Update explaining that custom pipelines must read settings.Integrations.GetJsonObject("Segment.io").GetJsonObject("httpConfig") themselves.
RetryStateStorage: batch keys are absolute file paths and on Windows contain backslashes. SerializeState escaped them but DeserializeState never unescaped, so keys failed to round-trip and per-batch retry metadata was orphaned after a restart. Add UnescapeJsonString and an escape-aware closing-quote scan; apply to all parsed keys and string values. Adds round-trip tests for Windows paths and embedded quotes. e2e-cli: stop clearing deliveryErrors inside the poll loop. The pipeline only reports an error on a permanent drop (never on transient retries), so clearing while other batches are pending could mask a genuine drop as a passing test.
…um persistence EventPipeline/SyncEventPipeline (M1): marshal SaveRetryState onto the FileIODispatcher via Scope.WithContext instead of writing the full prefs file inline on the network thread, matching Kotlin's withContext(fileIODispatcher) (EventPipeline.kt:273). EventPipeline/SyncEventPipeline (M2): mark _retryStateMachine volatile and snapshot it into a local once per flush cycle, so a concurrent UpdateHttpConfig swap on the AnalyticsDispatcher can't yield inconsistent upload decisions across batches in the same cycle. RetryStateStorage (M3): persist PipelineState by name (Ready/RateLimited) instead of the raw ordinal, decoupling the on-disk format from enum order and matching Kotlin. Still reads the legacy "1" ordinal for upgrades. e2e-cli (M5): allow the stable-empty early-break even when pending was never observed (a fast 200 can complete before the first poll), using an extra stable poll to absorb the async-write window — avoids paying the full timeout on every fast-success test. Declined M4 (ReportInternalError on routine drops): current behavior matches Kotlin and the e2e contract requires a dropped batch to report failure (http-status-codes.test.ts:331). Adds tests for by-name PipelineState round-trip and legacy ordinal load.
SegmentDestination.Update applies CDN-driven httpConfig only to the built-in EventPipeline/SyncEventPipeline (via as-cast). A custom IEventPipeline supplied through Configuration.EventPipelineProvider silently stays in its constructed retry mode. Document the contract on IEventPipeline and at the cast site rather than expanding the public API surface (HttpConfig and friends are internal, and adding a method to the public interface would source-break existing implementers).
Summary
RetryStateMachinearchitecture from analytics-kotlin to align retry behavior across all three SDKsEventPipelineconsults before/after each upload — no in-loop retry; failed batches stay on disk and are retried on the next flush cyclehttpConfigwithrateLimitConfigandbackoffConfigsub-configs (presence implies enabled unless explicitlyfalse)enabled: false(legacy mode) until CDN settings arriveNew retry behavior (when enabled):
httpConfigupdates reconfigure the state machine without losing stateFiles added:
Analytics-CSharp/Segment/Analytics/Retry/— RetryStateMachine, RetryState, RetryConfig, RetryTypes, TimeProvider, RetryStateStorage, HttpConfigParserTests/Retry/— 47 unit tests covering state machine logic, storage round-trips, and config parsingFiles modified:
EventPipeline.cs/SyncEventPipeline.cs— upload loop consults state machineHTTPClient.cs— X-Retry-Count header, Retry-After parsing,UploadWithResponsemethodSegmentDestination.cs— parseshttpConfigfrom CDN settingse2e-cli/— retry-enabled pipeline provider, flush/poll loop for e2e testingTest plan
basicsuite: 2/2 passingretrysuite: 46/46 passingretry-settingssuite: 20/20 passinghttpConfig